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ABSTRACT 

Markov Chain Monte Carlo (MCMC) techniques are now widely used for cosmological pa- 
rameter estimation. Chains are generated to sample the posterior probability distribution ob- 
tained following the Bayesian approach. An important issue is how to optimize the efficiency 
of such sampling and how to diagnose whether a finite-length chain has adequately sampled 
the underlying posterior probability distribution. We show how the power spectrum of a single 
such finite chain may be used as a convergence diagnostic by means of a fitting function, and 
discuss strategies for optimizing the distribution for the proposed steps. The methods devel- 
oped are applied to current CMB and LSS data interpreted using both a pure adiabatic cos- 
mological model and a mixed adiabatic/isocurvature cosmological model including possible 
correlations between modes. For the latter application, because of the increased dimension- 
ality and the presence of degeneracies, the need for tuning MCMC methods for maximum 
efficiency becomes particularly acute. 
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1 INTRODUCTION 

The availability of high quality data from both CMB (Hinshaw et 
al.; Kogut et al. 2003) and large-scale structure (Percival et al. 2001 ; 
Tegmark et al. 2003) experiments has allowed the field of precision 
cosmology to advance rapidly in the last few years. Methods of 
cosmological parameter estimation are allowing us to narrow down 
the range of possible universes by placing bounds on the param- 
eters describing a particular model. Because many of the models 
have a large number of parameters, ranging in complexity from the 
simplest scale-invariant ACDM cosmology to those including spa- 
tial curvature, massive neutrinos, a dark energy equation of state, 
cosmic strings, or a non-adiabatic contribution, it has become com- 
monplace to use Markov Chain Monte Carlo (MCMC) methods 
to explore the probability distributions of high dimensionality ob- 
tained following Bayesian statistics. 

MCMC methods were first introduced in the 1950s (Metropo- 
lis et al. 1953) to sample an unknown probability distribution ef- 
ficiently and are described in detail in Neal (1993) and in Gilks, 
Richardson & Spiegelhalter (1995). Instead of calculating the prob- 
ability density at sites on a regular grid spanning the entire param- 
eter space, one draws samples sequentially according to a proba- 
bilistic algorithm. The sequence of states visited forms a Markov 
Chain distributed according to the probability distribution to be ex- 
plored. Rather than scaling exponentially with the number of pa- 
rameters varied, the time needed to sample a distribution grows ap- 



proximately linearly with dimension. For this reason MCMC meth- 
ods are particularly useful for cosmological parameter estimation. 
This application has explicitly been discussed by Christensen et 
al. (2001), Knox, Christensen & Skordis (2001), Lewis & Bridle 
(2002), Verde et al. (2003), Tegmark et al. (2003) and others. 

All MCMC algorithms share the property that asymptotically 
the distribution of states visited by the chain is identical to the un- 
derlying distribution. However, it is critical to be able to determine 
whether and to what extent a finite length chain is a fair sample 
and can be used to make confident and accurate estimates of statis- 
tics characterizing the underlying distribution. This idea of 'conver- 
gence' to a stationary distribution has been discussed extensively 
in the statistical literature (see e.g. Cowles & Carlin 1996 for a 
review), and there exist a range of convergence tests that can be ap- 
plied to MCMC output, many of which are described in the CODA 
manual (Best, Cowles & Vines 1995). Both Heidelberger & Welch 
(1981,1985) and Geweke (1992) use spectral methods to determine 
the total running length of a simulation and the length of an ini- 
tial transient to be discarded. Gelman & Rubin (1992) study the 
dispersion of the means of multiple chains, while Raftery & Lewis 
(1992) use second order Markov chains to insure that percentiles 
are estimated within a given accuracy. In the cosmological context, 
the package of CODA tests were first applied by Christensen et al. 
(2001), with the Gelman & Rubin test further explored by Verde et 
al. (2003) and included with the Raftery & Lewis test in the COS- 
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MOMC package (Lewis & Bridle 2002). Methods for speeding up 
the convergence have been considered by Slosar & Hobson (2003), 
in the COSMOMC package and by Doran & Mueller (2003). 

Here we describe a convergence test for MCMC methods that 
we have tested and found to work well for several CMB parame- 
ter estimation problems. The test tells us when one can stop run- 
ning the MCMC chain and use it to estimate statistics of the under- 
lying distribution. We draw upon various techniques incorporated 
in existing convergence analyses, studying the spectral behaviour 
generic to chains generated with the Metropolis algorithm. 

This article is organized as follows. Section 2 reviews the 
Metropolis algorithm and section 3 defines the convergence of a 
chain. Section 4 formulates a spectral convergence test based on fit- 
ting to a template, which we have empirically found to work well, 
to the sample power spectra of finite length chains. This section 
demonstrates the effectiveness of the proposed spectral test using 
Gaussian distributions of various dimensions and differing ratios 
of the covariance of the trial distribution to that of the underly- 
ing distribution. Section 5 explores how to optimize the trial dis- 
tribution for the proposed steps, tailoring its covariance to that of 
the underlying distribution. In section 6 we investigate what can 
go wrong, considering various non-Gaussian distributions, ranging 
from mildly non-Gaussian to more pathological. Section 7 summa- 
rizes our method in the form of an explicit 'recipe,' which is then 
applied to CMB+LSS data in section 8. Finally, we conclude with 
a discussion in section 9. 



2 THE METROPOLIS ALGORITHM 

The Metropolis algorithm, used to sample a probability distribu- 
tion p(x) in one dimension, works as follows. Starting at an initial 
position xq, we generate a sequence of points xi,X2, ■ ■ ■ accord- 
ing to the following rule. is generated from Xi by attempt- 
ing a trial step x tr iai distributed according to a trial distribution 
qtriai(xtriai,Xi), typically but not always chosen with the spe- 
cial form qtriai(xtriai,Xi) = q tr iai{xtriai — Xi). The trial dis- 
tribution must be symmetric and chosen such that all points with 
p(x) / can be connected by the chain. Because of considera- 
tions of detailed balance, the symmetry of q tr iai ensures that p(x) 
is stationary under the Markov process. We shall typically use for 
qtriai(xtriai — Xi) a. Gaussian of vanishing mean and adjustable 
variance a%. If Pr — p(x tr iai) /p(xi) ^ 1, then x i+1 is set to 
xtriai with probability one. Otherwise, x i+ i is set to x trial with 
probability Pr and to Xi with probability (1 — Pr). If the chain 
moves to xtriai, the step has been 'accepted'. Otherwise, we say 
that it has been rejected. The correlated chain of steps Xi explores 
the full range of the sample space spanned by p(x) such that even- 
tually, in the infinitely long chain limit, the distribution of points 
visited is exactly described by the distribution p(x). Most of the 
time is spent exploring regions of high likelihood, but all regions 
where p(x) is non-zero are eventually explored. 

To sample a D-dimensional distribution p(x) using the 
Metropolis algorithm, x trial is replaced by x tr iai, where (x tr iai — 
Xi) is distributed according to a multivariate Gaussian of zero mean 
and covariance matrix Ct- This D-dimensional step x tria i can be 
drawn from a Gaussian distribution of non-diagonal covariance by 
drawing D Gaussian random samples y with unit variance and gen- 
erating Xtriai = C T 1 '" 2 y, where Cj, 1 ' 2 is the positive definite 
matrix square root. 



3 DEFINING CONVERGENCE 

The distribution of points visited by the infinite MCMC chain de- 
scribed above is described exactly by the underlying probability 
density p(x). Statistical properties of the distribution such as the 
mean, median, and quantiles can therefore be calculated directly 
from the chain. In practice, one must use a chain of finite length to 
estimate these statistical properties of the underlying distribution. 
Errors arise from the truncation to a finite chain both because of 
shot noise and because of correlations between successive elements 
of the chain. Following the literature, we shall say that a finite chain 
has 'converged' when its statistical properties, suitably defined, re- 
flect those of the underlying distribution p(x) with 'sufficient accu- 
racy' that the chain can be terminated. To determine whether con- 
vergence has been achieved requires answering the following two 
questions: 

1 . Has the chain fully traversed the region of high probability in 
such a way that the correlations between successive elements of 
the chain does not bias the inferred distribution for p(x)7 Apart 
from shot noise, a chain that is too short may explore only a single 
peak or portion of a single peak where p{x) is significant. 

2. Can we then estimate suitably defined statistics about the under- 
lying distribution p(x) with sufficient accuracy? 

In the next Section we show how to extract from a single chain 
the information about the large-scale correlations of the chain that 
indicate whether the first requirement above has been satisfied. To 
address the second question, a level of accuracy for a given statis- 
tic must be specified. Let /i and o% be the mean and variance of 
the underlying distribution p(x), respectively. A useful diagnostic 
is the variation of the sample means x obtained from a finite chain. 
Let cr| (iV) indicate the variance of the sample mean, defined by 
averaging over independent realizations of a finite chain of length 
N. We assume a starting point determined according to the under- 
lying distribution, in other words that any initial transients of the 
chain have decayed away. The variance in the sample mean may be 
characterized by the dimensionless ratio 

r=^|, (1) 

which we shall call the 'convergence ratio'. We require r to lie be- 
low some cutoff value, chosen for example to be 0.01. The sample 
mean variance has been used in various ways in the literature as a 
diagnostic of convergence, as outlined in Cowles & Carlin (1994). 
Heidelberger & Welch calculate confidence intervals on the mean 
using the ratio a^/x, but do not use the distribution variance. The 
Gelman & Rubin test incorporates a similar ratio: their R statistic 
roughly translates to R ~ 1 + r, but the quantity is calculated us- 
ing multiple parallel chains. In the next section we show how to 
estimate the convergence ratio r from a single chain. 

If a chain is started far outside the region of high probability, 
an initial section of the chain consisting of steady progression into 
this region will be unrepresentative of the underlying distribution 
and must be discarded. In the literature this initial transient has been 
dubbed 'burn-in.' We use the ratio of the probability p compared to 
the maximum p ma x as an indicator of how much of the beginning 
of the chain to chop out, for example while p/p ma x < 0.1. Such 
truncation is unnecessary if the chain is started from a point already 
known to lie in the region of high probability, for example from an 
earlier chain. 
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4 SPECTRAL ANALYSIS OF MCMC CHAINS 

An infinite one-dimensional Gaussian random chain for which the 
underlying statistical process is independent of position may be ex- 
panded into Fourier coefficients according to 

r dk 



x(k) e 



(2) 



where the reality of the x n implies that x(k) = x* (— k). The two- 
point correlations can be characterized in terms of the Fourier co- 
efficients 



(x(k) x*(k')) =S(k-k') P(k), 



where the power spectrum P(k) is an even function. Any correla- 
tion linking unequal k and k' would contradict the hypothesis of 
position independence. 

We may define the two-point autocorrelation (i.e., the chain 
variance) 



Co = (x n ) 



(4) 



giving an average measure of the power spectrum. Similarly, the 
correlation with an offset TV is given by 



Cjv = {x n Xu+n) 



dk 

■P P(k) cos(fcTV). 

Z7T 



We consider the sample mean 

1 N-l 

n=0 

the variance of which is given by 

1 f+* dk sin 2 [TVfc/2] 



(x N ) 



P(k) 



TV J_ n 2nN sin 2 [fc/2] 
for a chain with zero mean. Since for all integers TV > 0, 
dk sin 2 [TVfc/2] _ 



2ttTV 



'■[k/2] 



(5) 



(6) 



(7) 
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eqn. is a weighted average of P(k) that becomes more and more 
concentrated around k = as TV becomes large. Because 

1 sin 2 [TVfc/2] 



lira 

n^oo 2nN 



'[k/2] 



S(k), 



(9) 



(x 2 N ) 



P(k = 0). 



(10) 



it follows that for large TV 
1 

TV 

Consequently, estimating the sample mean variance of a long chain 
is equivalent to estimating P(k) at k = 0. 

In practice, we want to estimate the power spectrum P(k) 
from a finite chain of length TV. To do so we define the derived 
random variables 



N-l 

—= x n exp \i2w(jn/N) 

V J ' n 



(11) 



where [j = -(N/2 - 1), -(N/2 - 2), ... , -1, 0, +1, ... , 
+ (TV/2 — 1), +(N/2)] and N is even and considered fixed. These 
coefficients are the discrete Fourier transform of the chain. The TV 
new variables a J N result from a unitary transformation acting on the 
original chain and are Gaussian and independently distributed, with 



1 

TV 



dk sin' 



2nN sin 2 [(fc-fc f )/2] 



where kj = 2n(j/N). 

The variables a J N can be used to estimate the power spectrum 
P(kj) from a single realisation of a finite length chain. We first 
calculate the discrete power spectrum Pj of the single finite chain 



P - \a +j \ 2 

3 — \ a N I ' 



(13) 



for j = l,... N/2 - 1. 

Adopting the approximation on the far right of Eqn. i 1 2t as 
exact, we find that Pj/P(kj) obeys a \ 2 distribution with two de- 
grees of freedom. We find that 



(3) (H^])=l n [P(%)]-m[2]+.A 1 



(14) 



and 

, 7T 2 

VarflnP,- ) = A 2 - (Aif = — w 1.645 (15) 

6 

where 

1 r°° 

Ai = - / dx ln[a;] exp[-a:/2] = In [2] - 7 » 0.1159 (16) 

2 Jo 

and 



Az 



dx Intel exp[-a;/2] = h (In [2] - -yf (17) 

6 



where 7 ~ 0.577216 is the Euler-Mascheroni constant. 

P(kj) can therefore be inferred by fitting ln[Pj] to template 
power spectra using least squares, providing an estimate for P(k = 
0). 



4.1 MCMC Power Spectra 

An ideal sampler draws from the underlying distribution with no 
correlations between successive elements of the chain. The power 
spectrum is absolutely flat, with P(k) = a 2 for all k. Here a is the 
standard deviation of the underlying distribution. This is a white 
noise spectrum. Fig.Qplots the Pj obtained from such a chain. In 
this case, satisfying the convergence criterion r < 0.01, requires 
only 100 steps. 

An actual MCMC chain always has correlations on small 
scales due to the nature of the Metropolis algorithm. If the trial dis- 
tribution is chosen such that very small trial steps are attempted, the 
chain propagates diffusively, behaving locally like a random walk. 
For large trial steps, on the other hand, the chain remains stuck 
at one point for quite some time before accepting a step. The ini- 
tial and final points of these jumps are almost uncorrelated, but the 
jumps occur infrequently. A balance between these two extremes 
through a judicious choice of the trial distribution minimizes the 
correlations between successive elements of the chain. Once the 
chain has fully travelled around the region of high probability, the 
correlations at the largest scales begin to vanish, and the large-scale 
behaviour starts to mimic an ideal sampler. 

If we examine the power spectrum of an actual chain, we ob- 
serve a white noise spectrum on large scales where k is small, turn- 
ing over to a spectrum with suppressed power at large k. The po- 
sition of the knee where the white noise turns over to suppressed 
power with a different power law reflects the inverse correlation 
length. We illustrate this behaviour with an MCMC chain sampling 
a five-dimensional Gaussian model with zero mean and unit vari- 
ance in each dimension, sampled with an MCMC chain of length 
TV = 3000 using the Metropolis algorithm. The trial distribution is 
a Gaussian of width gt ~ 1.1, which we later show to be optimal. 
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Figure 1. (Top) The discrete power spectrum of an 'ideal,' uncorrelated 
chain formed by drawing points at random from a Gaussian distribution of 
unit variance. (Bottom) The discrete power spectrum from an MCMC chain 
of length N = 3000 sampling the same five-dimensional Gaussian. 
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Figure 2. The exact power spectrum of one of the variables of an MCMC 
chain sampling the same 5-D Gaussian distribution as in the lower panel of 
Fig-0 measured by averaging over a large number of chains. 



The five-dimensional chain better illustrates the effect of correla- 
tions on the power spectrum than a less correlated one-dimensional 
chain. The power spectrum of one of the components of the chain 
is shown in the lower panel of Fig.Q To remove the noise we simu- 
late 5000 identical chains and find the average of the sample spectra 
as shown in Fig. [2] Except at the very smallest scales, the small- 
scale behaviour is well approximated by a power-law of the form 
P(k) oc k~ a , with a typically in the neighbourhood of but not 
exactly equal to 2, which would correspond to the spectrum of a 
perfect random walk. 



4.2 A parametric fit to the power spectrum 

In analysing the power spectrum of a chain of finite length, we 
would like to answer two questions: (1) Whether the white noise 
part of the spectrum has been reached for the lowest accessible k 
(i.e., k = j(2iv/N) where j is a small nonzero positive integer). 
Note that an estimate at k — is not accessible because we almost 
always lack independent knowledge of the mean of the underlying 
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Figure 3. In the top panel, the chain of the lower panel in Fig. is fitted to 
the template defined in Eqn. 1181 . This fit is indicated by the dashed curve. 
The lower panel compares the obtained fit to the measured exact power 
spectrum (shown as the solid curve). 

distribution. (2). What is our estimate for P(k — 0) so that the 
variance of the sample mean can be estimated? 

Question (1) is not really a problem involving hypothesis 
testing — that is, testing the hypothesis that the spectrum is in fact 
exactly white noise — because at the outset we know that at any fi- 
nite k the spectrum is not exactly flat. Instead the relevant question 
is whether one is probing k small enough so that the asymptotic de- 
viation from white noise present on the largest scales is sufficiently 
small. Because of this, what we really want to do is fit to a template 
that models the transition from white noise at the largest accessible 
scales to correlated behaviour on small scales. 

An appropriate template has the following form: 



P{k) = P - 



(18) 



{k*/k) a + r 

Here the free parameter Po gives the amplitude of the white noise 
spectrum in the k — » limit. The parameter k* indicates the posi- 
tion of the turnover to a different power law behaviour, character- 
ized by the free parameter a, at large k. This template models the 
observed behaviour more closely than for example a polynomial fit 
used by Heidelberger & Welch (1981) to estimate P(k = 0) from 
the power spectra of discrete event simulations. 

For a chain of length N, our power spectrum data consists of a 
single realization of the random variables Pj , j = 1, . . . , N/2 — 1, 
defined in eqn. Jl 31 . For the parametric model defined by the above 
template, 

In Pj = ln[P(fc, )] - ln[2] + Ai + r 3 

. r „, , T (Nk*/2nj) c 
= ln[P ]+lri V 11 



- 7 + Tj 



(19) 



_ 1 + (Nk* /2nj)° 

where rj are random measurement errors characterized by the ex- 
pectation values (ri) = 0, {riTj) — <5y7T 2 /6. We fit InPo, k* and 
a using least squares, over the range of Fourier modes 1 ^ j ' ^ 
jmax- For a spectrum which turns over at j* — k* (N/2n), an ap- 
propriate limit is jmax ~ 10 j*, which gives equal weighting to 
the two power law regimes in log space and avoids using the very 
highest j values which have small scale artefacts. In practice the 
limit j m ax = 1000 is typically used as a starting point, but a sec- 
ond iteration may be used to get a better fit once j* is known. The 
results are fairly insensitive to the exact value, but we find Po to be 
overestimated when jmax 3> Wj*. 

We now test how well the functional form of this template fits 
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Figure 4. (Left) Distribution for the quality of fit P^ zt /Po, obtaining P^ zt 
by fitting 5000 individual chains, compared to the true value Pq obtained 
by averaging. (Right) Quality of fit Pq/Po at the lower la level as a 
function of j*, the number of Fourier modes in the white noise regime. The 
dashed line indicates the mean value for j* > 20, with very short chains 
tending to overestimate Po . 
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Figure 5. True (solid) and fitted (dashed) power spectra of chains sampling 
the distribution as in Fig. [2] using trial distributions of widths <r T /<ro = 0.2 
(left), 0.5 (middle) and 2 (right). 
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the exact power spectra of the MCMC chains. We fit the template 
initially to the Pj of a single chain of length TV = 3000, as shown 
in Fig. [3] obtaining the parameters Po = 16.6, a — 1.95 and 
k* = 0.14, corresponding to j* — 64, and a sample mean variance 
estimate <x§ = Pq/N — 0.0055. The discrete power spectra Pj 
are then evaluated and averaged over a large number of long chains 
to measure the exact spectrum P(k). The lower panel of Fig. [5] 
compares the fit of a single chain to this exact power spectrum. 

Since Po,a and k* are inferred from a single finite-length 
chain, not necessarily very far into the white noise regime, errors 
will be introduced in the sample best-fit parameters compared to 
the ideal fit to an infinitely long chain. These errors can be esti- 
mated by simulating a large number of finite chains, and fitting the 
template to each individual spectrum to measure the dispersion of 
the best-fit parameters. We wish to avoid underestimating Po, in 
order to prevent diagnosing convergence too early. For 5000 chains 
of length TV = 3000, we find P ( f %t = 17 ± 4 (quoting the median 
and 68% confidence limits). The exact value obtained from the av- 
erages of all the parallel chains is Po = 16. In Fig.[4]we show the 
distribution of Pq'* /Po, with a lower la (16th percentile) limit of 
0.8. The accuracy of the template fit for varying chain lengths is 
then checked by measuring how P /Po (at the lower la level) 
varies as a function of the number of Fourier modes in the white 
noise regime j* , plotted in the left panel of Fig. [4] For j* <; 20, 
the estimate of Po has little scatter. For smaller j* the tendency is 
to overestimate Po . 

We now investigate the quality of our template fitting proce- 
dure when the width ar of the trial distribution is sub-optimal. We 
test the goodness of fit of the template to the true power spectrum, 
as well as obtaining estimates for the possible errors on the best- 
fit parameters. The procedure described above is repeated with the 
same underlying distribution, but sampling with a trial distribution 
of varying width. In Fig.[3]we show the averaged power spectra for 
chains generated using the range of step-sizes ot /o"o = 0.2,0.5,2, 
with a best-fit template for comparison. Their power spectra all 
have the same form and the template fits well, with the dispersion 
of the best-fit parameters given in TableQ The possible underesti- 
mate for Po is in the range 0.7 < P /Po < 0.8 at the la level, 
which is suitably accurate for our purposes. The slope of the curve 
at small-scales is given consistently by a ~ 2, corresponding to 
near random-walk behaviour. 



Table 1. Quality of template fit for chains sampling with various step sizes 
<jt I Co ■ The distributions are obtained by fitting the power spectra of mul- 
tiple chains to derive the median and 68% limits for the best-fit variables 
In Po, a an d Ink*. We give the physical quantities Po, a and k", with the 
true Pq obtained by averaging for comparison. 

4.3 Testing for Convergence 

Any finite chain can be tested for convergence once the variables 
Po, a and k* are obtained for each parameter separately. The fol- 
lowing two requirements are made: 

1. kmi n must be in the white noise regime P(k) ~ k°, defined con- 
cretely by the requirement j* > 20. This insures that the correlated 
points are not biasing the distribution and indicates that the chain 
is drawing points throughout the full region of high probability. 

2. The convergence ratio r = a^/ag is calculated using the esti- 
mate for Po, with r = Po/N for chains normalized to have unit 
variance. To obtain statistics with good accuracy, r < 0.01. 

As an example Fig. [3] shows the spectra of a chain sampling the 
model distribution discussed in the previous section at three stages. 
After 500 steps the chain is still correlated at the largest scales 
measured and has not yet visited the entire distribution. It fails the 
convergence test immediately since P(k) is not constant at large 
scales, with j* — 5. After 1200 steps the chain has entered the 
white-noise regime at large scales with best-fitting j* = 20, but 
has not yet drawn enough samples to get good statistics; r = 0.02 
and the test is failed. Finally after TV = 2500 steps the power spec- 
trum is white noise at large scales, with j* = 54. The best-fitting 
Po = 17.5 gives a convergence ratio r = 0.007, and the test is 
passed. 

Once all the parameters have passed the test we are confident 
that the chain has converged and could stop it, but may choose to 
run it for longer to get more samples to reduce shot noise from 
the histograms. This is particularly relevant for lower dimensions 
where the chain can converge after relatively few steps. In prac- 
tice we find that for high dimensional chains (D <; 8), there are 
enough samples by the time the test is passed. This would not be 
the case if we thinned the chain (i.e., saving only every mth point 
in the chain). This corresponds to cutting out small scale correla- 
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Figure 6. Discrete power spectra, with the best-fit template (solid) 
and true (dashed) power spectra for a chain of varying length N = 
500, 1200, 2500, from unconverged (left) to fully converged (right). 
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Figure 7. Power spectrum (left) and resulting histogram (right) of a chain 
sampling a model Gaussian distribution (top), compared to the spectrum 
and histogram of the thinned chain (bottom) where kt = 0.1. 



tions with k > k t , where k t = n/m. The effect of thinning on the 
power spectrum and the recovered distribution is shown in Fig.0 
Not only is the sample mean variance increased by thinning, but 
the loss of points is very noticeable in the shot noise of the his- 
tograms, particularly in higher dimensions. Since the convergence 
test assures us that the correlated points are not biasing the chain 
output, they are not removed. 



5 OPTIMIZING EFFICIENCY AND PREDICTING 
CONVERGENCE LENGTHS 

Because of limited computational resources, it is of great impor- 
tance to maximize the efficiency of an MCMC chain through a well 
chosen trial distribution for the attempted steps. As discussed pre- 
viously, too small trial steps lead to random walk behaviour, lead- 
ing to large inter-step correlations and hence slow convergence, 
whereas with too large trial steps jumps occur too infrequently, 
likewise leading to large correlations and slow convergence. In this 
section we investigate how to choose an optimal trial distribution 
between these two extremes. 

The efficiency of a chain may be quantified as follows. Let 
cr| (A^) be the variance of the sample mean from chains of length 
TV. The dimensionless efficiency E of an MCMC chain is defined 
(see e.g. Neal 1992) by comparing its sample mean variance to an 
ideal chain (i.e., completely uncorrelated) in the long chain limit: 



E : 



hm 



(20) 



Here erg is the variance of the underlying distribution, and o-q/N 
gives the sample variance of an ideal chain of length N. E^ 1 there- 
fore gives the factor by which the MCMC chain is longer than 
an ideal chain yielding the same performance. A chain closest to 
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Figure 8. Dependence of the inverse efficiency 1 /E on the trial distribution 
step-size <tt / cro (left) and the acceptance fraction /a (right) for a chain 
sampling a one-dimensional Gaussian model. 



ideal will have minimum correlation and therefore maximum effi- 
ciency. The efficiency is related trivially to the number of steps 7V C 
needed to give a convergence ratio of r at N c — (rTS)" 1 steps, 
since r = al(N)/o%. It follows that E = a 2 /P(k = 0). Previ- 
ous work on choosing an optimal step-size for the trial distribution 
includes Gelman, Roberts & Gilks (1996) and Hansen & Cunning- 
ham (1998), who investigate optimizing the efficiency for multi- 
variate Gaussian underlying distributions. 

We first consider how best to sample an underlying 73- 
dimensional Gaussian distribution 

P{v) = 77. \xr>io exp[-3?72ag] (21) 



where x — [x\, . . . ,xd), choosing a trial distribution 
1 



q(y) 



(2-KGT 2 ) D / 2 



ex p[-y /2o- T ]. 



(22) 



We calculate the efficiency as a function of cjt by running 10 
independent chains of length N started with a point chosen at ran- 
dom according to p(x). For 73 = 1 optimal efficiency is attained 
at <tt/(7o = 2.4 (see Fig.[8} at a maximum efficiency of 0.22, giv- 
ing a convergence length N c ~ 450 for r — 0.01. The fraction of 
steps accepted is Ja ~ 0.4. In the diffusive, random walk regime, 
where (ot/o"o) <C 1, almost all trial steps are accepted and the 
inverse efficiency scales as 1/E oc {cjt / 'oo)~ 2 ■ At the other ex- 
treme (or/co) 3> If almost no steps are accepted and the inverse 
efficiency scales as 1/E oc ar/cro, which is also proportional to 
the acceptance rate. For D — 1, it is clearly better to err on the side 
of too large trial steps. 

For higher dimensions (D > 1) the results are shown in 
Fig. [5] The optimal efficiency follows a power law 



E 



3.373' 



(23) 



which works reasonably well for all 73 > 1 at an optimal step-size 

(24) 



a T /cr w 2.4/V73. 

This translates into an optimal convergence length (for r = 0.01) 
of 7V e w 330 73. We also found, as in Gelman et al. (1996), that 
the acceptance rate of the optimal chain tends to /a =0.25 for 73 
greater than about four. We also observe that in higher dimensions 
it becomes progressively more important to choose or not to large, 
because the efficiency for large steps scales as E oc (cto/ctt) , 
as illustrated in Fig. [5] The theoretical notion described above that 
MCMC chain length scales linearly with dimension is only true 
for optimal cjt, which is much harder to achieve in high dimen- 
sions. Fig. 1101 shows the power spectra for optimal chains in 4, 8 
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Figure 9. Dependence of the inverse efficiency E on the trial distribution 
step-size \Z~Dut /uo (left) and the acceptance fraction (right) for a 
chain sampling a D = 1 (bottom), 2, 4, 8 and 16 (top) dimensional Gaus- 
sian model. 
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Figure 10. Variation in power spectra with dimension for 4 (bottom), 8 and 
12 (top) dimensional toy Gaussian models, sampled with a chain of length 
N = 5000 using the optimal step-sizes obtained from Fig.l9l 



and 12 dimensions. It is clear that in higher dimensions the corre- 
lation length increases, seen as a lower wavenumber k* at which 
scale-free behaviour begins and leading to a higher Po and reduced 
efficiency. 

In the general multivariate case, where 



p(x) 



det" 1/2 [C] exp 



1 t n - 
--x C 



(25) 



(2tt) d / 2 

where C is the covariance matrix, choosing an acceptable trial 
distribution amounts to choosing the D(D + l)/2 independent 
elements of the covariance matrix for the trial distribution Ct 
with sufficient accuracy. From simple rescaling, it is apparent that 



D 


^ ideal 




4 


40 


530 


8 


110 


2900 


16 


230 


12000 



Table 2. Number of steps needed to estimate the covariance matrix of a 
D-dimensional Gaussian distribution, such that the square root of all eigen- 
values are correct to within 25%. N i( i ea i corresponds to an uncorrected 
sampler; N mcmc = Nid £a i/E provides an estimate for an MCMC chain 
with inverse efficiency 1/E ~ 3. 3D. 



Ct = (o"t/o-o) 2 C is the optimal choice of trial distribution 
where (<jt/<Jq) is chosen as for the previous special case where 
C = oil. 

Fig. IH shows how the efficiency varies when Ct oc C but a 
sub-optimal scaling factor <jt /(To is used. It is quite likely that the 
two distributions will not be proportional, and in this case we can 
expect a great reduction in the efficiency. To illustrate the scenario 
in two dimensions we take a univariate trial distribution Ct to sam- 
ple a bivariate underlying distribution with covariance C. If this 
distribution has widths differing by a factor of ten (e.g. by taking 
Cn = 1, C22 = 100), the optimum inverse efficiency attainable 
is 1/E — 50, found by varying <tt and calculating the efficiency 
as described earlier. It compares badly with the optimal 1 /E ~ 7.4 
when Ct oc C, and the chain takes seven times longer to converge 
than if a proportional trial distribution were used. In higher dimen- 
sions, sampling correlated underlying distributions with parameters 
a few orders of magnitude apart, the cost of using a non-optimal 
trial distribution can be even more extreme. 

Without prior knowledge of the underlying covariance C, at 
least one preliminary chain will be necessary to obtain a reasonable 
estimate for the trial distribution. The number of steps required for 
such a chain can be estimated by first considering an uncorrelated 
chain, where points are drawn at random from a Gaussian distri- 
bution. To obtain its covariance matrix with sufficient accuracy we 
would like the square root of all its eigenvalues to be within ~ 25% 
of their true values. By varying the lengths of such chains in var- 
ious dimensions we find the length Nid ea i where this criterion is 
satisfied, shown in table [2] For an MCMC chain with efficiency 
E, a conservative estimate for the number of steps needed is then 
approximately Nm Cmc m N/E. These chain lengths are given in 
table|2|for optimal efficiency E w 3. 3D. Preliminary chains with 
very low efficiency would need to be significantly longer, so an it- 
erative method to improve the covariance matrix estimate may be 
used. 

So far we have considered the chain length N c needed to sat- 
isfy the convergence criterion cr| < O.OIct 2 . We are often more 
interested however in limiting the extremes of a distribution, deter- 
mining what part of the parameter space has been ruled out say at 2 
or 3cr. Here 'shot noise' or Poisson counting statistics becomes the 
primary limiting factor in determining the fraction of points (and 
hence integrated probability) with probability p < p c , when p c is 
small. If there were on average n points in the region p(x) < p c 
(for n N), the standard deviation in the count fluctuation would 
be o n = \fn, giving a fractional error 8n/n = 1/s/n. This is true 
for an uncorrelated chain. For an MCMC chain of length iV with 
efficiency E, the equivalent number of independent points is ap- 
proximately EN, so a conservative estimate of the fractional error 
becomes Sn/n = 1/ V En. To achieve a given accuracy requires 
a chain with 1 / E more points than an uncorrelated chain. We test 
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that this expression gives an upper bound on the fractional error by 
studying the fluctuation in the number of counts obtained beyond 
both the 2 and 3a limits for a Gaussian model of dimension D, us- 
ing multiple chains. We find Sn/n < 1/V En to hold for a large 
range of chain lengths over the range 1 < D < 16. 



6 WHAT CAN GO WRONG: THE WORST CASE 
SCENARIO 

To apply these methods to practical parameter estimation, we must 
know how the power spectrum shape and the relation linking the 
covariance of the underlying distribution to that of the trial distri- 
bution for the optimal chain are altered for non-Gaussian underly- 
ing probability distributions models, and whether it is possible for 
the convergence test to fail. Clearly examples can be constructed 
where the power spectral convergence test described above (as well 
as competing tests) give the appearance that the underlying distri- 
bution has been adequately sampled when this is not at all the case. 
We consider a range of possible distributions, from mildly non- 
Gaussian to more pathological distributions. 

6.1 Mild non-Gaussianity 

We consider two non-Gaussian models with probability densities 

Pi( x ) = ^[p( x > ff o = 1) + p(x,a = 2)j (26) 

P2{x) = i[p(a;,CTo = 1) + p(x,a = 4) J (27) 

where p(x, ao) is the D-dimensional Gaussian in eqn. |2H . We 
investigate whether the power spectra of the resulting chains are 
well fit by our template and how the optimal trial step-size relation 
is altered. Both two- and eight-dimensional models are considered 
to include possible high-dimensional effects, and the efficiency is 
calculated as a function of step-size, \TDot / '(To- 

Fig. II ll shows the dependence of the inverse efficiency on the 
trial step-size and the acceptance fraction /a, for both distributions 
pi(x) and P2(x). In two dimensions, little difference is observed 
in the behaviour for either distribution compared to the Gaussian 
case shown in solid lines. The optimal step-size for the underlying 
distribution pz(x) is slightly higher, with a correspondingly lower 
acceptance rate. The efficiencies are very similar, although a chain 
with high acceptance fraction (with a smaller step-size) is more 
strongly penalised. In eight dimensions the underlying distribution 
Pi(x) shows similar behaviour to the Gaussian with optimal effi- 
ciency reduced by a factor ~ 1.2, but the more non-Gaussian p2 (x) 
shows more extreme behaviour, with the inverse efficiency tightly 
peaked around the optimal step-size, with an efficiency lower by a 
factor of ~ 2 and a significant penalty for non-optimal sampling. 
The choice of ctt/oo = 2A/\fT) that one might make without 
prior knowledge of the form of these distributions would be too 
large for this eight dimensional case, but still produces reasonably 
efficient chains for both distributions considered. In the lower pan- 
els of Fig. \H\ the smoothed power spectra for both distributions 
are shown, displaying the same spectral behaviour as the Gaussian 
chains. 
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Figure 11. (Top) Dependence of the inverse efficiency 1/E on the trial 
step-size or / <ro and acceptance rate Ja for two non-Gaussian distribu- 
tions, pi(x) (dot-dashed) and P2{x) (dashed), compared to the Gaussian 
model (solid) in two dimensions (lower curves) and eight dimensions (up- 
per curves). (Bottom) Smoothed power spectra for chains sampling the dis- 
tributions pi(x) (dot-dashed) andp2(^') (dashed), in D = 2 and D = 8 
dimensions. 



dimensional distribution 



p(xi, X2) ~ A exp 



(z^W-i) 2 



8(7,? 



(28) 



where the normalization A is chosen appropriately and ay <C lj or 
the more realistic non-symmetric distribution 



p(x 



1,X2 



A exp 



(x 1 2 +X2 2 



8<r 2 



2aj' 



(29) 



where <jg <C 1 but ay < a e . The first distribution is concen- 
trated over an annulus of width oy around the unit circle. To tra- 
verse the region of high probability rapidly, one would want to at- 
tempt small steps in the radial direction but comparatively larger 
steps in the azimuthal direction. Locally, this could be achieved 
by a elongated bivariate Gaussian in the coordinates xi,X2\ how- 
ever, as one moves around, the axis of elongation would have to ro- 
tate. However, in order to satisfy detailed balance, so that the chain 
reproduces the underlying probability distribution, it is necessary 
that qtriai{xtriai, Xi) remain constant throughout the simulation; 1 
therefore, an adaptive algorithm that allows C't to evolve during 
the simulation is excluded. 

In the second example, the simply connected region of high 
probability has the shape of a thin crescent (shown in Fig. ll2L A 
simple change of variable can in principle solve the problems de- 
scribed above. The pathology of this distribution can be character- 
ized by the dimensionless parameter r\ — ag/a r , which gives the 
approximate number of standard deviations by which the region of 
highest probability is displaced from the two-dimensional mean. 



6.2 Curved distributions 

For D ^ 2 a potential problem arises when the region of high 
probability is elongated and curved. Consider for example the two- 



1 Clearly a more complicated form for qtrial{ x trial> x i) n °t simply de- 
pending on the difference (x tria [ — asj) can alleviate this slowdown, but for 
the general case finding an appropriate distribution qtrial ma Y he difficult. 
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Figure 12. Two-dimensional probability distribution given by eqn. 1291 
with variables xi,X2, for r) = 1 (left), r) = 5 (centre), and (right) the 
same distribution with transformed variables y\ = x\ + — 1> D2 = %2- 
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Figure 13. Variation in inverse efficiency X/E with step-size ctt/cto, for 
the distributions shown in Fig. 1 121 with curvature parameters r] = 1 and 
r\ = 5 (upper curves) as defined in the text. Parameter x\ is dot-dashed, X2 
is dashed, compared to the transformed parameters yi (solid). 



Any change of variable that renders the region convex (for example 
one which locally resembles polar coordinates) makes the underly- 
ing distribution sufficiently close to Gaussian in the new variable, 
so that the previously described methods are adequate. Here for ex- 
ample the new parameters j/i = x\ + x\ — 1, yi = a; 2 transform 
the crescent to a bivariate Gaussian. 

We investigate the effect of using the original coordinates on 
the efficiency of a chain sampled using the standard Gaussian trial 
distribution. We take two examples with r\ = 1 and r\ = 5, whose 
distributions are shown in Fig. 1121 following the same procedure 
as described in section 5 to measure the efficiency as a function 
of the step-size. As the distributions are no longer symmetric we 
use Ct — or/coC, measuring a separate efficiency for each pa- 
rameter. In Fig. 1 131 these results can be compared to the optimal 
efficiency for a Gaussian, obtainable using the transformed coordi- 
nates (solid line). For the mildly curved distribution with r\ = 1 the 
efficiency is simply reduced by a factor ~ 1.5, but the optimal step- 
size is unchanged. The more extreme r\ = 5 case shows markedly 
different behaviour however, with the less efficient x\ parameter 
fifteen times slower than the optimal Gaussian case, for a step-size 
twice as big and a far lower acceptance rate, }a ~ 0.05. The high 
curvature and long tails make it impossible to converge quickly us- 
ing a Gaussian trial distribution, since the trial and underlying dis- 
tributions overlap so badly. Low acceptance rates combined with 
diffusive steps result in highly correlated chains. In practice this 
slowdown is readily diagnosed using our spectral test while running 
the chain. Coupled with a tendency towards a very low acceptance 
rate for a naive optimal step-size ot /&o ~ 2.4/V^D, this would be 
a clear indication of the need to reparametrize. 

In high dimensions it is often difficult to identify an appropri- 
ate change of variables empirically, although Kosowsky, Milosavl- 



jevic & Jimenez (2002) and Jimenez et al. (2004) describe such 
a transformation of a set of simple-inflationary cosmological pa- 
rameters into a set of nearly-orthogonal 'physical' parameters. In 
practice however, we have found the non-transformed cosmologi- 
cal parameters to be sufficiently well-behaved for flat cosmologies, 
achieving high efficiencies. 

6.3 Bimodal or multi-peaked distributions 

Failure of the spectral convergence test is most likely to occur 
for distributions having multiple narrow peaks of high probabil- 
ity connected by wide regions of low probability. While the length 
•^single peak required to sample adequately the region around a sin- 
gle peak may be rather short, the length iVtunnci required to tunnel 
to other peaks may be substantially longer. 

The simplest example is a symmetric bimodal distribution 



p(x) = 



exp 



Or- 



- exp 



(x + a) 



(30) 



with variable peak position at x = ±a. We wish to know if the 
spectral convergence test can fail for this model by indicating con- 
vergence before the full distribution is recovered. The behaviour of 
the chain will depend on both a and the trial step-size or/co- We 
explore this dependency and its effect on the chain power spectrum 
by sampling from the distribution with variable peak separation a, 
using a fixed trial distribution width <tt- 

The observed behaviour, shown in Fig.^| can be split roughly 
into three categories. For small peak separation a the chain jumps 
between the two peaks frequently, converging toward both peaks si- 
multaneously. The power spectrum only displays large-scale white- 
noise behaviour at late times when both peaks are fully sampled. 
The template fits the spectrum well at early and late times, shown 
in Fig. 1141 (top two rows). If a is increased sufficiently, then the 
chain may converge 'locally' on one of the peaks before it takes its 
first jump to the second peak, shown in the third row of Fig.ll4lfor 
a = 6. Before it makes this jumps the chain would pass the spectral 
convergence test, but as soon as it swaps peaks an excess of power 
is produced on large scales (4th row). It now fails the convergence 
test, although the template does not provide a very good fit. After a 
much longer time (6th row) the chain will have sampled both peaks 
sufficiently to converge 'globally', recovering the correct distribu- 
tion with peaks of equal height. The chain is no longer biased to- 
wards one of the peaks, so again the power becomes white-noise on 
large scales and the convergence test is passed. For large separation 
a the chain can sample one peak without jumping to the second 
peak in any reasonable time. This is observed in Fig. 1 141 (bottom 
row) for a — 8. For a length ten times longer than Nc it does not 
jump, and we would falsely conclude that the distribution had only 
one peak. The same effect is seen if the peak separation is narrower 
(e.g. a — 6) but the trial step-size is very small. 

From these models we can conclude that if the chain visits a 
second peak even briefly, then it will show up as excess large-scale 
power in the power spectrum and the chain will not pass the con- 
vergence test until it has sampled both peaks sufficiently. This is 
useful since in high dimensions a jump might not be so obvious in 
the time-ordered chain output. If the peaks are very widely sepa- 
rated (or if an overly small step-size is being used) then the chain 
might not leave the first peak in a reasonable time. If there are just 
two peaks then multiple chains may be used to diagnose this prob- 
lem, but for multiple disconnected peaks this could be insufficient, 
although this is an unlikely scenario for the case of cosmological 
parameters. 
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Figure 14. Behaviour of chains with fixed trial distribution sampling a bi- 
modal distribution with peak positions x = ±4 (top 2 panels) x = ±6 
(middle 4 panels) and x = ±8 (bottom panel). The chain output (left), re- 
covered distributions (centre) and power spectra (right) are shown for vari- 
ous chain lengths n. 



Finally we investigate how the optimal scaling relations are 
altered if the distribution is bimodal, considering a range of peak 
separations from a = 2 to a — 10. Using the same methods de- 
scribed earlier, we find that the optimal step-sizes are similar to 
the Gaussian case, shown in Fig. 1151 although slightly reduced to 
ot /(To ~ 2. It is important to note that ao applies here to the full 
distribution, not to a single peak. The optimal acceptance rate de- 
creases as the peak separation increases, to /a ~ 0.1 for a = 10. 
The inverse efficiency then gradually increases as the chain be- 
comes more correlated as a consequence of steps occurring more 
infrequently. The unusual behaviour of the chain getting stuck in- 
definitely on one peak, seen in the final row of Fig. m 

occurs for 

step-sizes smaller than those considered here. 
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Figure 15. Variation in inverse efficiency l/E with trial step-size cry / <ro 
and acceptance rate for a one-dimensional bimodal distribution with 
Gaussian peaks at x = ±a for a = (bottom), 2, 4, 6, 8 and 10 (top). 



7 SAMPLING METHOD 

This section outlines an explicit sampling procedure based on the 
methods and considerations presented in the previous sections. This 
is the procedure that we applied to extracting cosmological param- 
eters from the CMB+LSS data. 

1. An initial best guess is made for the covariance matrix of the 
underlying distribution Ci, which is used to fix the multivariate 
Gaussian trial distribution characterized by the covariance matrix 
Cti, chosen according to the rule 



(2A 2 /D)d 



(31) 



If possible, a starting point in the high likelihood region is chosen. 

2. A short chain is sampled using Cti to obtain a refined estimate 
for C2, which in turn is used to update Ct according to the same 
rule in eqn. 13 II If this first chain started in a region of low likeli- 
hood, the first section (where L/L max <J 0.1) is discarded before 
estimating C2 ■ If the acceptance rate for this initial chain is very 
low (e.g. <J 0.01) or very high (e.g. <; 0.9), then this first chain 
does not provide a useful estimate of the covariance matrix and a 
new guess should be made for C\ . 

3. A second chain is started with trial distribution Ct2, starting 
where the previous chain finished. The process is iterated until 
further refinement would not lead to a sufficiently significant im- 
provement in efficiency. The efficiency compared to 'optimal' can 
be judged using both the acceptance rate as an indicator, and the 
power spectrum test to give an estimate for Po . For real, approxi- 
mately Gaussian distributions tested for dimension D <^ 8 we have 
found that one update of the trial distribution to Ct 2 normally suf- 
fices to sample efficiently. 

4. Only the final chain is used for the analysis, which is tested for 
convergence once N > N c . The best-fit template for the power 
spectrum of each cosmological parameter is found, to give Po, oe 
and k*. The test is passed if k m in is in the regime P(k) oc k°, 
defined by the condition j* > 20, and if the convergence ratio 
r < 0.01 for each parameter. If the white noise-regime has been 
reached but r > 0.01, an estimate can be made for how much 
longer the chain needs to run using r oc 1/N. 

6. Once the chain has converged, the histograms of the number den- 
sities are checked to insure sufficient points have been collected, 
and further sampling may be carried out if a characterisation of the 
wings of the distribution at high a are desired. 

7. If a series of updates of the trial distribution fails to improve 
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Figure 16. MCMC chain for a pure adiabatic cosmological model, showing 
the chain output (left), power spectra (middle) and marginalized distribu- 
tions (right) for three parameters of a 7-dimensional model, sampled with a 
chain of length N = 9000. 
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the efficiency or produce convergence, often signalled by a very 
low acceptance rate, it is likely that the distribution is highly non- 
Gaussian and reparametrization should be explored. 



Figure 17. MCMC chain for a mixed adiabatic/isocurvature cosmological 
model. Power spectra (left) and marginalized distributions (right) for four 
parameters of the 16-dimensional model: the baryon density ojj, optical 
depth t and relative power in two non-adiabatic modes, neutrino velocity 
(NIV) and correlated adiabatic/neutrino density isocurvature (A, N I D). 



8 APPLICATION TO CMB PARAMETER ESTIMATION 

We applied the above methods to estimating cosmological pa- 
rameters using the CMB and LSS data and code as described 
in Bucher et al. (2004). Initially we consider a 7-dimensional 
pure adiabatic cosmological model defined by the parameters 
u)d, Q.a ,n 3 ,T,/3, and an overall amplitude A s . An appropriate 
trial distribution and high-likelihood starting point are easy to ob- 
tain given already-available results for this distribution in Spergel et 
al. (2003). After updating the trial distribution once, our second and 
final chain runs for 4000 steps with an acceptance rate of 26% be- 
fore achieving convergence with white noise power at large scales 
and r < 0.01 for all parameters. The values of the power spec- 
trum template variables are given in Table[3]for the most and least 
efficient parameters. The underlying distribution is fairly close to 
Gaussian. We are able to sample very efficiently, achieving an in- 
verse efficiency of ~ 40, compared to an optimal inverse efficiency 
of ~ 25 for 7 dimensions. We continued running the chain to 9000 
steps for improved statistics. Fie. ll6l shows the time series of three 
of the seven parameters, their power spectra and best-fit curves and 
the recovered marginalized posterior distributions. The smoothed 
posterior distributions are obtained by fitting the logarithm of the 
binned number densities to a high order polynomial, as described 
in Tegmark et al. (2003). 

We then extend the parameter space to 16 dimensions, includ- 
ing general isocurvature perturbations, requiring nine extra param- 
eters to quantify the additional mode contributions. As described 
in Bucher et al. (2004), we include the CDM isocurvature mode 
and the neutrino density and velocity isocurvature modes, creating 
a 4 x 4 symmetric mode matrix with the adiabatic mode and cross- 
correlations. Physically all the eigenvalues of this matrix must be 
non-negative. This constraint was imposed by assigning a zero like- 
lihood to those trial steps violating it. With very little prior knowl- 
edge of this probability distribution and a higher cost of sampling 
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AD 


AD+ISO 


AD+ISO 


best 


worst 


best 


worst 


P = 1/E 33 


39 


185 


498 


a 1.8 


1.8 


1.8 


1.8 


3* 90 


77 


92 


37 



Table 3. Best-fit template variables for the power spectra of chains sam- 
pling cosmological parameters. All chains are normalized to have unit vari- 
ance. The template parameters for the best (i.e., lowest 1/E) and worst 
cosmological parameters are indicated for both the adiabatic models (AD) 
and the mixed adiabatic/isocurvature models (AD+ISO). 

non-optimally, the trial distribution was updated more times to im- 
prove the efficiency. The chain converged after 6 x 10 4 steps, with 
Ti < 0.01 for all parameters. In this chain 60% of steps violated the 
constraint, so that only N moc i e is = 2.4 x 10 4 actual cosmological 
models were evaluated. In Fie. ll7l we show power spectra and the 
recovered distributions for four of the parameters, with the template 
variables given in Table [3] By taking an effective number of steps 
N mo deis = 24000 we find an inverse efficiency of 1/E = 240 
for the worst parameter. Since this distribution is non-Gaussian and 
of high dimension, it is not surprising that we do not achieve the 
optimal Gaussian inverse efficiency of 1/E = 53. 



9 DISCUSSION 

We have shown how a spectral test based on an empirical fitting 
function can be used to diagnose reliably the convergence proper- 
ties of a single long MCMC chain. Explicit criteria for convergence 
using this test were formulated and subsequently demonstrated to 
detect lack of convergence for a variety of underlying distributions. 
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A procedure to optimize the covariance matrix for a Gaussian trial 
distribution was also explored. 

The distributions for which our test did not successfully de- 
tect a lack of convergence were those with widely separated, nar- 
row multiple peaks of high probability density. Although for such 
distributions no universally valid procedure exists for uncovering 
the presence of multiple peaks, a possible method is to run multi- 
ple chains starting from widely separated, randomly chosen points 
in the sample space. The Gelman & Rubin test provides an appro- 
priate convergence diagnostic for multiple chains. The variance of 
the sample means of the several chains is compared to the estimated 
variance of the sampled distribution, obtained by averaging over the 
sample variance within each chain. In the case of multiple peaks or 
lack of convergence about a single peak, the variance of the sample 
means will be too high, failing to fall below a specified fraction of 
the within-chain sample variance. 

The spectral test applied to single chains has the advantage 
of exploiting all the information relevant to estimating the sam- 
ple mean variance. Rather than simply comparing a small number 
of parallel chains, the spectral test effectively divides the chain in 
2jmax independent ways when all the Pj with 1 ^ j ' ^ j mal are 
used to fit to the template. 

We find that the spectral test and trial distributions optimized 
according to the procedure outlined above work well when applied 
to CMB parameter estimation. As described in detail above, a chain 
of length N — 4 x 10 3 suffices to obtain r <, 0.01 (equiv- 
alent to a 10% standard deviation for the sample means of the 
cosmological parameters) for an adiabatic cosmological model, al- 
though longer chains were actually employed to reduce shot noise 
error in the outer edge of the region of high probability. For the 
full adiabatic/isocurvature model, convergence was attained within 
N = 6 x 10 4 steps, which involved calculating only 2.4 x 10 4 
distinct cosmological models due to large zero-likelihood regions. 
Neither model showed evidence of any of the pathologies explored 
in section 6. The chain performance attained may be compared to 
that expected from a Gaussian of the same dimension explored us- 
ing a chain with an optimal trial distribution. Our efficiencies were 
lower by factors of approximately 1.6 and 5 for the two cases, re- 
spectively. 
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